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IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 
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F. Herz et al 
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DESIRABLE OBJECTS 
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Examiner:Huynh-Ba 
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CERTIFICATE OF MAILING UNDER 37 CFR 1.10 

I hereby certify that this paper (along with any paper referred to as being attached or enclosed) is being deposited with the 
United States Postal Service on the date shown below with sufficient postage as Express Mail, Express Mail Label No. 
EL055940569US in an envelope addressed to Box AF, Assistant Commissioner fat Patents, Washington, D.C. 20231 



in A^crvsr j^K 

Date 




James M. Grazi; 



RULE 131 AFFIDAVIT 



The Assistant Commissioner 

for Patents 
Washington, D.C. 20231 

Dear Sir: 
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We, Frederick S. M. Herz, a resident of Davis, State of West Virginia; 
Jason M. Eisner a resident of Philadelphia, State of Pennsylvania; Steven L. 
Salzberg a resident of Baltimore, State of Maryland; Jonathan M. Smith a resident 
of Princeton, State of New Jersey; being duly sworn, depose and say: 

That we are co-applicants named in the above-titled patent application; 

That prior to March 29, 1995, we jointly conceived and subsequently jointly 
constructively reduced to practice the invention disclosed in the above-identified 
patent application; 

That there are no documented records providing the exact date of 
conception of the above-identified patent application; 

That the invention described in the above-identified patent application was 
subsequently constructively reduced to practice, as shown in the attached 
documents Exhibit A and Exhibit B, which documents were produced prior to March 
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29, 1995; 

That the attached Exhibits A and B are reproductions of the original records 
(with dates blacked out) referred to in this Affidavit; 

That the attached Exhibit A comprises a facsimile copy of a draft disclosure 
document that we created prior to March 29, 1995 and transmitted to James M. 
Graziano, attorney of record in the above-identified patent application, via facsimile 
prior to March 29, 1995; 

That the attached Exhibit B comprises a draft of the above-identified patent 
application, produced by the above-mentioned James M. Graziano and transmitted 
to us prior to March 29, 1995; 

That both Exhibit A and Exhibit B describe the basic concept of the invention 
claimed in the above-identified patent application, that is the concept of providing 
a user with access to electronically stored target objects via the automatically 
generated target profiles for the target objects and user target profile interest 
summaries; 

That this novel structure is disclosed in both the attached Exhibit A and 

Exhibit B, as well as specifically claimed in the above-identified patent application, 

as in for example claim 1 as it presently stands in prosecution: 

1 . A method for providing a user with access to selected 
ones of a plurality of target objects that are accessible via an 
electronic storage media, where said users are connected via user 
terminals and bidirectional data communication connections to a 
5 target server system which includes said electronic storage media, 

said method comprising the steps of: 

automatically generating target profiles for target objects that 
are stored in said electronic storage media, each of said target 
profiles being generated from the contents of an associated one of 
10 said target objects and their associated sets of target object 

characteristics; 

automatically generating at least one user target profile 
interest summary for a user at a user terminal, each said user target 
profile interest summary being generated from target profiles 
1 5 associated with ones of said target objects accessed by said user; 

and 

enabling access to said plurality of target objects stored on 
said electronic storage media by users via said target profiles and , 
said at least one user target profile interest summary. 

That the originals of Exhibit A and Exhibit B are in the possession of James 
M. Graziano, attorney of record in the above-identified patent application; and 
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That all of these acts and record were made during the regular course of 
business in the United States of America prior to March 29, 1995. 

We hereby declare that all statements made herein of our own knowledge 
are true and that all statements made on information and belief are believed to be 
true; and further that these statements were made with the knowledge that willful 
false statements and the like are punishable by fine or imprisonment, or both, 
under 1 8 U.S.C. §1 001 , and that such willful false statements may jeopardize the 
validity of the application or any patent issued thereon. 




Frederick S. M. Herz 



Subscribed and sworn to before me this 



My commission expires: &*pt /3 A ooC 



of n£_ 1998. 




NOTARY PUBLIC 

STATE OF WEST VmfiM 

REBECCA WHITE 

ffT.&BOKKK 
WYMEaWV 




(SEAL) 



Jason M. Eisner 



Subscribed and swom to before me this 



day of , 



, 1998. 



My commission expires: 



Notary Public 



(SEAL) 
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That the originals of Exhibit A and Exhibit B are in the possession of James 
M. Graziano, attorney of record in the above-identified patent application; and 

That all of these acts and record were made during the regular course of 
business in the United States of America prior to March 29, 1995. 

We hereby declare that all statements made herein of our own knowledge 
are true and that all statements made on information and belief are believed to be 
true; and further that these statements were made with the knowledge that willful 
false statements and the like are punishable by fine or imprisonment, or both, under 
18 U.S.C. §1001, and that such willful false statements may jeopardize the validity 
of the application or any patent issued thereon. 



Frederick S. M. Herz 



Subscribed and sworn to before me this 



day of 



1998. 



My commission expires: 



Notary 



Public 



(SEAL) 




Subscribed and sworn to before me this 




day of 




NOTARIAL SEAL 
KIMBERLY A. CONTE, Notary Public 
Lower Merion Twp. ( Montgomery County 
ifo Commission Expires August 13, 2001 



Myucommiaston expires; 




(SEAL) 
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Steven L. Salzberg 

Subscribed and sworn to before me this <9Q day of JzUA-^ . 199©. 
My commission expires: 




Notary Public 
(SEAL) 
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Notes far a patent on a Draft 

Method for Customized Information Retrieval Using Personal Profile* 

/ / */ 

work by: Fred Herz, Mitch Marcus, Jonathan Smith* Eric Brill, Jason Eisner, Lyle llnga* 

Field of the Invention 

The present invention relates to a method for automatically determining which news articles 
a user is moat lflcely to wish to read from an on-line news source such as the AP news wire 
or Reuters, and allowing the user to select from those articles, More particularly, the 
method involves constructing a "profile" for each article based on the frequencies with 
which each word appears in each article relative to its overall frequency of use in all 
articles. User profiles are then constructed by observing which articles the users read. 
Because people have multiple interests, multiple profiles are kept for each user, ^ 
corresponding to multiple topics of interest Bach user is presented witiuhpse-amcles- 
jwK5seBrofiles most closely match Ms or her profiles. User profiles arCautomotially_ 
jipda^oh « condnuing basis to reflect each viewer's changing interests. Alternatively, 
documents are grouped into clusters and menus are automatically generated to allow users 
to navigate the clusters and locate documents of interest. Documents may be news, 
electronic mail, or product descriptions, and profiles may include data from structured 
databases as well as ftee text.. Similar methods can be used to allow users to find people 
with similar profiles and hence interests for the purpose of commerce or pleasurable 
discussion. 



Description of the prior art 

Researchers in the field of information retrieval have devoted considerable effort to finding 
efficient and accurate methods of allowing users to select documents of interest from a large 
set of documents. The most widely used methods are based on keyword matching: the 
user specifies a set of words which s/he thinks will be in a document and the computer 
retrieves all documents with contain those words, Such methods are fast, but are 
notoriously unreliable, as users may not think of the right keywords* Use of logical 
combinations of words and of wild cards (e.g. [gorilla OR chimp*] and [adolescent* OR 
teen*) helps reduce, but does not solve this problem* 

Starting in the 1960% an alternate approach was proposed: users were presented with a 
document and asked if it was what they wanted, or how close it was. Each document was 
described by a profile: a list of die words in the document or, in more advanced systems, a 
list of word frequencies in the document One document is said to be similar to another if 
the distance between their profiles is small. Similarity of document profiles can be used in 
document retrieval, A user searching for a Information about a certain subject can write a 
description of what they are looking for. The computer then retrieves documents with 
profiles similar to that of the request. These requests can then be refined using "relevance 
feedback 11 , where the user rates the documents retrieved as to how close they are to what is 
being sought The computer then uses this information to refine the target profile, and the 
process is repeated until the user either finds enough documents or tires of the search. 

Traditional information retrieval uses relevance feedback for retrieving one or more 
documents in a single arts, (It presents a sequence of documents ana use distance from 
desired or undesired documents to determine what to present next.) In many applications 
such as the reading of newspapers, users are interested in multiple topic areas, and maintain 
their interests over time. The method proposed in this patent differs from standard 
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^formation retieyal methods in that h keeps multiple profiles of each user, which are 
maintained and refined over time. The use of such profiles Is also extended to include 
automatic generation of menus, and to the matching of people with similar profiles. 

DISCUSS OTHER PATENTS, PAPERS 

This patent speaks entirely of counting words in a document, but the methods proposed can 
be trivially extended to counting a-grams of letter. (N-grams are just sequences or N 
contiguous letters in a document For example, this sentence contains a sequence of 5- 
grams which starts "For e", "or ex" "r exa", * exam", "examp" .,.) This patent could build 
upon such methods as, for example in patent # (Marc D.) which describes how such n- 
gram methods can be used for information retrieval. . b ^ 

A number of researchers have looked at methods for selecting of most interest to users. 
Pam Maes and coworkers at MIT produced the Ringo system for recommending musical 
selections. Their system requires active feedback from the users (users must specify how 
much they like or dislike each musical selection, as opposed to this system which 
determines user interests by monitoring their behavior). Also, Ringo keeps a complete list 
of users ratings of music selections and makes recommendations by finding which 
selections were liked by multiple people. Unlike the system described below, Ringo does 
not take advantage of any available descriptions of the music, either strucrared o"esoripttons 
in a data base or free text such as reviews. 

DISCUSS OTHER Clustering work 

A couple of other research groups have looked at the automatic generation and labeling of 
clusters of documents for the purpose of browsing through the documents. ? and Marc 

a Prcp 05 * a m « hod w »ere documents can be displayed on a two dimensional plot 
wim distances between the documents indicating how dissimilar they are. They general 

cluster lables by , A group at Xerox Pare has developed a method they call 

"scatter/gather" in which... 7 

The method proposed below differs in that it uses a hierarchical clustering method to 
generate a menu tree and then further modifies this tree based on data collected as users 
access data through the menus generated from the tree. 

Overview of potent 

This patent proposes a fundamental methodology for matching people and objects by 
calculating, using and automatically updating profiles describing the users' interests and the 
objects ofiaracteristics. The ''objects^may be text documents, purchasable items, or even 
otner people. Examples include a person looking for a newspaper story of potential 
interest, a movie to watch, an item to buy, or another person to correspond with. In all 
cases, the i method is based on determining the similarity between a profile for the object and 
aprofflefofthe uscr. Object profiles may include some or all of die following: (\\ 

SSSSfJSSfP ST 001 "** <1«* («* the author of the newspaper article, the actor and 
actors and rating riven to a movie, the price and manufacturer or a product, or the name 
and phone number of the person placing an advertisement) and (3) a bst of the persons who 
have accessed (e.g. read or purchased) the object. The structured data may, in particular, 
contato Information about the quality of the object, such as measures of its popularity (how 
often it is accessed) or of user satisfaction (number of complaints received): 

J " v "."^v lw "«»-ui5? uic Hiimiuruy 01 prorues oesenomg odjccis anu similar profiles 
desenbing a user s interests can be applied In two basic ways.' filtering and browsing. 
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2ft? £ us £ uI W J ,W ^numbers of documents arc being received bv a u**r wh A 

documents which the user is moW likely to wish to itti nSSSffiLiw 
below improves iu Altering accuracy oJcr time ^^VS^S^^^S^ 

Browskg provides an alternate method of selecting a small subset of a laree number of 
™1% 8 « » * word "document below* but pleaw 1^ in iZ 32fe 

£^^^n? 8 ^ pti ^ S i of J teI ^ being 501(1 OT of different peopl^ iand1 S 
interests^ Documents are organized so that users can navigate amon* the i docu^nfc k„ 

d^mfctt S ps °* d ™™" to smaller, moreTp^^ 

10 ^eTftlBtroumenTw 
chX to b^SE*^ ^T, ^ to be grouped into cluster? and the 

fi™ A.. to vS.^ 0Uped ^ to lar $ er clusters. These hierarchies of clusteTsthen 

StSSiS fortt ™ uin g navigational systems to allow rapid se^S Ambers 



The following 

profiles for dwument retrieval: " U,C " ,V! W1 UCVCi °-P?s ^ US1 W 



?„^f^^L au ? ma,5caii y. bmin 9 and Storing menuing systems for browsin* 
JjliT^^p 11 ^ between the claims describes a computer architecture which oermits th ft «* 



Preferred embodiment - news clipping service 

SSSTtST * ^ * 0l>rt intCWtS - A ™^ Si? 

wo^ffr^lif "W a P* 0 ^, 0 **** on the relailve frequencies of occurrence of the 
A?S?i^ document ' ^se« are also assigned profiles bya method desSbedl SSi 
i£!Z? 6 ?*Ft* ^.^ved, their profifes are compared to SrC£ w ' 
r&S^Sf W -?l 0SeSl (raost to u«??SB KnS fte user 

provi ^ me articles ffgSrSS^T 
j r "I 08 ( ae numb « r of screens of data and the number of mmut* 8 ™™t 
leading), and adjusts the user's profiles to more closely match what was read P 

See figure 1, 
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1. retrieve new documents fipom document source 
(e.g. news from AP news wire) 

1 calculate document profiles 

3. compare document profiles to user profiles: select closest documents 

4. present list of documents to user 

5. monitor which documents are read 

6. update user profiles 

Fig. 1 Method for selecting documents of most interest to computer users 

521? tl E5 Jps ouU,ned > 1 is described In detail below. The details of the method 
require selecting manna nf finWlarino nm ni» „j„:t_ _»... J?7:° . 

profiles, and ofli^^^ w ™ 

1 . Retrieve new documents from document source. 

Documents arc available on-line from a wide variety of sources. In the preferred 
embodiment, one would use the current days news as supplied c.e. by the AP or Reut«r« 

ISEXS"?* * lc ?*°rdc bulletin boards, and can be used a- part of a syltemfor 
scwnlng and organizing electronic mail ('email'). r } v 

2. Calculate document profiles. 

£juS S . M!?i for each ^ ument which indicates the relative frequencies of word 
occurrences in this document relative to other comparable documents. Slanv measW Sn 

tou^v™ prome » ««»««ltt««« TF/IDP measure: the term (word) 

frequency (TF) times the inverse document freouencv (TOFY whsm *a. u™d ftw,„L„,, u 

r^TJ'^ Ul ^ u i re t. nc f s of a wo «« in a document divided by me to^'n'umber" ofwordY 

of the fracjon of documents in which the word occurs. (Other measures can be used but 
generally do not work as well.) Words which occur frequently (V W ' > 
bftame calculatcjddocument similarities without indicTg diiment Sics wd so are 
ry^altyremoved from feature vectors. Those skilled in them wiff be ffi Ktrfthf 
large literature on this subject; see, for example Saltan and McOill (1983) 

Other methods of calculating relative word frequencies can equally well be used, such a* 

S,iT ntic OT P^biUsdc models (refs: Rlj..!^ / For many^pfKs it 

svno^^r 11 ^ Sf f^ 8 actual, y *«* in *e dwumcnt w^tof 
synonyms or other words which tend to co-occur with the each word in the daSmem 

D^oSSflST 05 rr Ch to 9* her sto,Iw wo *« can used W Jniprove the 
jwforrnance of the profiling and menuing systems, Words which are simUar can be tested 
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as V they were the same word, or documents containing a word con be augmented with 
fS? 3 ^. ofth L wo S d<>rwith 0,her WOfds tend » «** with the wSd. Tmis m 




"staples" 

to* obpwr /kwft <?o<tf«f , efficiency ... 

3. Compare document profiles to user profiles; select closest documents. 

A set of profiles la stored for each user. These are initially selected by using the nroffles of 
documents that the user indicates are representative of h Is interest, by affi 3» SSrftr 

key words, or bv vsina a stanrfanl ca* a? wii™ , 1- i„ « it A*Z*~™Li. 

All iitr.' do^nr^^rf. ^ r---.Vr.V=».~r7ir. : f. "r"f! ~ •-I"" * , *»:«3j»"f"«» 

£2^^^ of sKity 

oerween aocuments and Queries: 

Z(tenuikterm|k) 
coslne(Profllei, Profilej) - - 

sqrtOE termik 2 E termjfc 2 ) 

J .... 

T«!3lf 155 3! ^ ' 8U "?, °y Cr "P 1116 tt, 7" (TF/IDFs for wordX termik Js the 
^cewostwor^ 

ffi^ ^^?if ocum 5 nts ttre ^t orthogonal to each other, and obvious, measures 
utner similarity metrics such as the dice measure can also be used. 

4. Present list of documents to user. 

SSJlSJifef ^ d0 $£ nw ? f ,w V**™** to the user, who can then select any 

5. Monitor which documents are read. 

£f iS!!5!i n 5!! ,lta ™ !f h ^"^nts the user ««ad*. keeping track of how many pages of 
text are read, how much time is spent v cw hg the document, and whether all oS of 
document were viewed. This inflation can be combined to ^SSScxJSXi^hi 
foe document Although the exact details depend on the length and nature of Se 
documents being searched, a typical formula might be a u» 

measure of document attractiveness « 

0.2 if the second page is accessed + 

0.2 if all pages are accessed + 

0.2 if more than 30 seconds was spend on the document + 

W.Z if mote thnn nilf. mtmiM. u/ue erutnH r>n th* /T AM .. M ..4 i 

!L. utes spent in thc doc °ro e nt are greater than half the number of 

pa £08 

6. Update user profiles. 



5 



5 



w ^ T r BY :n ) l E ,'? 0 Uo e A ecopier 7 ?lftiS--s>,UF.& 1 ; 215387^^ 303 449 9497!# 7 



Updating_of user profiles U done using the method described inoatent v (nmVifan 
SXmi ^X PIOmeC10SfiSt mC documentrcad ,s 8W ««J lightly towards the 

INSERT 3 PARAGRAPH DESCRIPTION OF PATENT 

< add alternate embodiments and Other considerations 

Preferred embodiment - Specialization for electronic mall (email) 

PROM ERIC BRILL 



Extension: Matching users for purchasing find virtual communities 




prrfuctde^ptions and of p^uc^v^^^ 

au8mcn !! 5d *** ^mtd database infomrnion. uSJSS^fiplnR 

be i ffirS", For 811,51(5 P^^ 5 ". ™us generated by clustering items can 
De used. This is particularly Important when a user may be shopping from items sold bva 
large number of vendors, where vendors' descriptions and pSt SSS^SS&L 




fiSS w «« rt n»*>l » l«*ing for specuVSms or abiffi In ?i£SSt 

5E8« is 

lueftil when one Is rcpeandlv seeUnt rimilfr n~,nfc /iVX ™SE!JS?ffiS " 

fiSEii V**^ a ^ butos such ** P ricc » availability of colors or sizMdeS 

fraSSfc ESSSrw SE. 8 "^ tcms S an te ,hc,udc<i *di relativl word 
*™Ku? lifl 8, ^^pP »eas«res) in the document profiles, as long as care is taken to 
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Preferred embodiment • virtual community 

Computer users frequently join other users for discussions on computer bulletin boards or 
news groups. In current practice, each bulletin board has a specified topic and user* then 
look through a long list of topics topically hundreds) to find topics of interest. The users 
then must select for themselves which of thousands of messages they find interesting from 
among those posted to the selected bulletin boards and, if desired, post additional 
discussion on the topics. The profiling system described above can also be used to allow 
users to find other people with suMllarinterests (i.e. to generate virtual communities of 
people with similar interests). Profiles of each person can be formed by the news-clipping 
method described above: a person's set of profuos is basically the centroids of the clusters 
of the profiles of the bulletin board articles read by the user. Users can then be grouped 
together based on the similarity of one or more of their profiles. Such groups of people 
have been reading and writing about similar topics and in similar styles and so will 
presumably ' share interests. New bulletin boards can be formed as sufficient numbers of 
users with different interests accumulate. 

The existence of thousands of Internet bulletin boards (also called news groups) and 

countless more private bulletin board services (BBS's) demonstrates the very strong 

interest among members of the electronic community in forums for the discussion 

of ideas about almost any subject imaginable. Currently, bulletin - ,. . ^ 

board creation proceeds in a haphazard form, usually instigated by a 

single individual who decides that a topic is worthy of discussion. 

There are protocols on the Internet for voting to determine whether a 

news group should be created, but there is a large hierarchy of news 

groups (all beginning with the prefix "alt") that do not follow this 

protocol. 

One can construct a browser for bulletin boards where each bulletin board is characterized 
by one or more profiles, so that potential new bulletin board users can locate bulletin 
boards by presenting the system with one or more messages typical of what they are 
looking for. Profiles would be generated from the messages using the methods described 
above, and then the user pointed to those bulletin boards most closely corresponding to the 

**vw umt i/u«iwt«i uvmus, «lk« Ml WUJOWtS W4UI JHVIUOS, VOU1U tUSO Otf Clustered 

and put into an automatically generated menu. 

Jht above method will be useful if people with the right set of interests have already 
formed a bulletin board or "chat group*. However, because people have varied and 
varying cwjolex interests, it is desirable to automatically create groups of people ('Virtual 
communities with common interests. The Virtual Community System (VCS) described 
below is a network-based agent that seeks out users of a network with common interests, 
aynamically creates bulletin boards or electronic mailing lists for those users, and 
introduces jhetn to each other electronically via e-maii. 

The functions of VCS are as described below. These are general functions that could be 
fflufSS&vP n . etwork ranging from an office network in a small company to the 
World Wide Web or the Internet. The four main steps in the procedure are as described in 
Figure 3: 



1. Scanning postings to bulletin boards and clustering people who post similar messages. 
1. Creating new bulletin boards 
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3. Announcing the bulletin boards 

4. Enrolling people in the bulletin boards 

Fig. 3 Method for creating virtual communities 



Bach of these steps U carried out as follows: 
Scanning, VCS finds people with interests in i 



52m SfhJt^ZhiTO*,^ J ntot ^ ^ comjnon wh0 fonn a virtual community 




" ~: ~^7Xr VTL * B '"r . 81 01 ouueon ^oaras that might be local to a sinale 
organization, for example a large company, a law firm, of a university? * 

^wSSaJfinSZTW 0 ^ bascd on M^Haxity of their profiles 
NaSS; rf^S.1i^ Cad8 , 0f *°« common intereats'among UiVu^re 

S^lSL^Kf^W 8 Gutter of sufficient size (generally 10-20 SBerart 
SaSSSSJ 11 dMftwnt Mb,ta ^ or ™ ilin S UW, it vff fo^ a new iS 



Hat A^KSiSJR?! W techniques for generating cluster label*) for die new mailing 

AevS^M Si f ^ pP6 ^ r - AU 8uch n^Hnglists will be Ume-stampS aSf 
tney are not used for a user-determined length of time, VCS will delete mem 

vcsX 2! J T a . J boMfl can be set by the person installing 

in J cSte? ^ES"" »Wi oollBotDdi U, aU the messages 

win Sa^SuS"* ^ of tlrao <p robabl y a fcw "4. it 

^faS^nd riS?™ 8 ""•"SI ™«>«™ically using its clusteHng 
$K Slte^ 589 ^ ^ ^posted to the bulletin boanj. 

SmZ!!te7 CS ,n [ 0rras al1 to members of the new virtual 
SS^irfteTW^ b0ljd (omoamngaXSd " 

fiSuSSalteSfK^ of the basis of the Sjmmunity. 

mSS dh£ £ ^SSir 5 volunl ^ly. VCS will send an e-mail 
message directly to everyone who posted any text that served as the 
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basis for a new bulletin board it has crested. As stated nbor™ '•■ 
will do Ae same for new mailing UstsV Howeven with bulletin* 
boards, it will simply inform users of the existence and name of the 
bulletin board, and leave it to thcra to decide whether or not they 
wish to read it 

Enrolling. Even after creation of a new Virtual Community, VCS 
will continue to scan existing mailing lists and bulletin boards for 
messages that belong to that community. Any new messages that appear 
will be cross-posted to the new community, and the person posting the 
message will be informed by VCS that he/she belongs to a new Virtual 
Community.' The user can then decide whether or not to read the new 
VCS bulletin board or, if the community is using a mailing list, to 
subscribe to the mailing list 

With these facilities, VCS will provide automatic creation of virtual 
communities in any Jooal or wide-area network. The cere technology underlying VCS 1$ 
creating a search and clustering mechanism that can find documents that are ''similar" in that 
♦u vtEX j m * n . terest8 - ™s IS precisely what was described above. One must be sure 
that VCS does not bombard users with notices about communities in which tbejlhave no 
real interest. On a very small network a human could be "in the loop," scanrdng proposed 
virtual communities and perhaps even giving them names. But on larger networks the VCS 
has to run in fully automatic mode, since it is likely to find a large number of virtual 
communities. 

Preferred embodiment - Menu generation based on clustering and automatic 
cluster labeling 

A browsing system can be constructed for the retrieval of other documents such as 
reference articles in an on-line library using the same methods underlying the news clipping 
S^JS?.™" 1 J"**™*** thatsince all reference articles are pofcntfally of interest (not 
2£J8u?2^ recent ones, as is the case for news), and since users are often looking for 
specific topics, the preferred interface includes a menuing system. 

In order to allow users to rapidly locate a document from among a large set of documents 
S2i3i?L Uhte ?^ ri1 ^ f 5 each document vdng the TF/IDF word frequency measures ' 
»csc,.i«* awove. i^wuurcms m incu grouped into dusters using a hierarchical clustering 
SSrfllS^ 8 !* k t: m0 ? n8 c J UBterI «* algorithm. Ousters are groups of documents which 
3? f ac ? other » J , - c ' 51086 to e ?ch other in some metric such as the coslgn measure. 

^ters produce & tree which divides the documents first into two large 
CIU8W8 of roughly similar documents; then each of these clusters is In turn divided into two 
or trwre smaller clusters, which in turn each divided into yet smaller clusters until a 
JeS 181 . ,8f o u nd consisting of a single document This division of documents provides an 
£S2S? , i ^ to K ^ ve a d °cument of interest: the user first chooses betweoi the 

rSSi^i^ 80 cIu8tcrB m $ t! ? cn 80lects thc sma"er clusters. This process is 
repeated until the user oomes to thc lowest level in the tree, which are the documents 



Hierarchical trees allow rapid selection of one document from a large set In ten menu 
wSoS* terns ° f It<5im CaCh ' 0IW can rea ° h 10l0s=10 ' 000 ' 000 ' 000 (on* hundred 
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add description of figure: 



1. calculate document profiles 

2, cluster documents into a hierarchical cluster 



!3. secerfit* IsJifi!* for Kicib. ft!*?***? 

A — J— !*.-. - 



t. gvuv*44tc menus freni clutter »uuCuup nud lobe!* 
5. optionally, monitor which documents are read and adjust menus 
Fig. t Method for selecting documents using menus 



Label generation ^ 

A key requirement to successfully use the menu is the ability to automatically label the 
clusters. Labels can be generated by selecting the document or documents closest to the 
center of the oluster and then displaying cither the document dde, or the set of words in the 
"-ofile of the centroid of the document cluster which have the highest reladve frequency 
FflDF), More informative labels can be generated by (1) removing redundant synonyms 
ra the labels either using a synonym dictionary or a morphological analyzer (e.g. "salt, 
salted, salty, saltier...." should reduce to "salt") and (2) using terms whloh have the highest 
discriminatory ability as labels, rather than diose with the highest frequency. <explatn 
hcwthisis.done> 

It is also useful to allow users to view the documents as needed in case the above 
descriptions are insufficient. 

4 Menu generation 

Although most clustering methods Generate binary branching trees, for human us*™ it is 
EESKiJ 0 ™ ve -f"? items in a xnenu . This is easily accomplished by displaying theVur 
grand children or eight grand children of a node in a cluster, (see figure x.) 

$ Menu updating 

As users access documents using the menuing system, certain documents and certain 
regions of the document profile space.will be accessed more frequently than others. These 
regions correspond to the users interests. If, for example, a user frequendy accessed 
documents close to a, b, and d in figure 3(a) then the menu in 3(d) could be modified to 
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A tnote> fbnnal algorithm for this Ik associate with each node of the tree a orobaWHtv that 
feSSS-f^f UOT ^«? wiM bekxjared under that mode ^hta fiJSSSf 

STtnJLS* nu »^°f documents, or (if data axe available) can be based on Sal 
acoesstq each document by all users. (I.e., the probability of access of icEer i* the ™ fl i 

naaaj 
have I 




Figure 3 Menu generation from document clusters 
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fa) Documant nmfi1<!« : h™* Uiu;*?^*...*.: „ ^~.„~,_„ ? _ 
nut^Lv^ „i tcijidvc wuni ircquencjesi circle* represent clusters. 
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u ur r Lncn lino* s^p lD-oc^uyo j^v ljiw No.UUb 



(a,b,c,d,c, 




/..'A / \ <*W 

b d ^ e « \ 




(b) Hierarchical cluster tree for documents in (a). 




h 1 



(c) Cluster labels for ace in (b) Document dtlcs, most distin«jishinc or 
most frequent words could also bo used for cluster labels. 




(d) Collapsed menu tree from tree in (c) 
Alternate embodiments of the menulng system 
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Many variations and improvement on the above menuing system are possible. Different 
clustering methods can bo used, such as fuzzv chisiftrin* 

structured information as well as free text, the word frequencies can be supplemented with 
other terms such as the manually assigned category of the article, the price of the object 



document (e,g. number of stars given to a movie in a movie review)! Similarity measures 
(distance metrics) can be adjusted to reflect different stress laid by different users on 
different terms or items. 

The automatic menulng system described above can be used in parallel with manually 
generated classifications of articles when such are available. 

Ul ** 0ftn M 0 * ,mo clufi ters either starting at the top of the tree and moving to more 

n^w.' » r il 1^ "" A 'i — V "V ,<=■," . .p.~" •wi^mo^k iiun^ n>uv>i la umiviiVU 

*»6ainsi one Oi tii5 Clusters, /viier nii »niuai duster is iocatea by finding the cluster center 
most similar to the query, the user can move to adjacent clusters. It is generally less 
efficlentta start at the largest cluster and repeatedly select smaller subclusters than it is to 
wnto a brief description of what one is looking for (e,g. an inexpensive microwave with 
1000 watts power, and then to move to nearby clusters if the objects initially recommended 
are not those desired. <explain further?> 

Preferred embodiment • architecture 

These retrieval methods can also be used on documents stored at multiple locations, such as 
ales at many internet sites. The methods described above still apply, but with minor 
difreiences in Implementation: software agents profile documents, ideally on each local 
machine, then the profiles are sent along with pointers to the original documents, to a 
cenmu I machine, where the clustering is done. Documents could be World Wide Web 
(WWW) pages, general computer files, directories of files, news group postings, e-mail, 
on line design documents, technical papers on their abstracts, etc. 

FROM JONATHAN SMITH 



Rlltnmakiir 



A method has been tsreaentftd for «ntnmnHrfllli/ teto*lnt> ar+iVUe nf^tAf, rr 



v.w ^t^wif "*^ f ^»'v^»vUM, xi *a uiuiiu;ieii«cu uy passive inpniroring (users do 



words in the articles r* ^ ^--j 5i j •!„ ; 

— ; - - - -- ■ -*-r; — 'v r** w *^nv*w«y r Miuuijyiw ^viuw pciouu ucucviim: imcrosi 111 

irorn -in c-dnt a f>.cr. trm TR/fnPrrtA&cnwi k<t**/4 /\« i.;^^ 

purchasable items). y 

A aicihou has uiso been presented for automatically generating menus to allow users to 
locate and retrieve information on topics on interest. This method clusters documents 
based on their similarity, as measured hv thtu tvJnHv* fiwi 



7 ^ W „T w*>*» v niui wwuiuMu uuvo vi vriui Key wuiutf 9AUUi;iOU 1TOU1 tflC 

uubuiiicni.. i lip mninnri ran rv». nr>nn/vi rr\ igroA ce**c r%t sirs s*,-.™ . 

oismouted over many machiiies* 



n 



The above methods can bft annliAti U% snvihirtfr fnr u;MnH nmfll, 

include newt articles, reference or work documents, electronic mail, product or service 

flftSrtJinflftttS r.r.-.r-A.'-- fhss*A .-.n iV~ i!. ! u~ J. .- .> t 

i» ~ " 'tf fc"J \~--—- v.. »..»• uuvwiiMiw u»vjr (uu IU Mip ucwi ipuuus vi wo pTOQUCtS 

tney buy), and electronic bulletin boards (based on the documents posted to them). 
Claims! 

1) A method for automatically building profiles of people reading documents in orfer to 
retrieve documents of greater Interest to the readers. (fc.g.. a custom news clipping 
service j 

2) A method for automatically labeling clusters of documents in order to allow users to 
rapidly locate and retrieve documents or. topics of interest to the readers. 



9a1 A nrw>tlwv4 fr\f 



- /• /•« 



3) A method for generating menus from hierarchical clusters consistine of the fnllnwina 
steps: " 0 

ff^m^fJ^^!S!^S!l customizing menu trees to allow users to more rapidly access 



5) A method for locating products or services based on profiles, Profiles are generated for 
each product or service based on the word frequencies of words in the product description 
and reviews and on other descriptive data. Then (a) products or services which the user 
often seeks information about can be "clipped" for the user as in claim X or (b) products 
and services can be clustered and compiled into a menu as in claim 2-3, 

6) A method for mntcMn 

kU«!tt/9 ciiU^.- • li_j"J"tl_~"j"."V"*.f.~.r. r ."7 """.J'.*** **" v * w '"? PWl>AM9a 9UWU «9 

«il ['aifcJf^M*** gooes or sciviucs. Profiles are generated for tuvh individual 

and then tadlviduals are matched based on similarity of their profiles. Again, either a) 

!E7^? g 1 * m i 11 ? f ? ft* "F* often 8eeks Wormadon from can be "clipped" for 

the user as in claim 1 or (b) individuals can be clustered and compiled into a menu asin 
claim 2-3. 

7) A method for matching of sets of people with common interests ("virtual communities") 
based on profiles developed from the messages which the people read and send. 

8) A scaleable method for retrieving documents distributed over large numbers of 
computers. <to be completed by JMS> 
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SYSTEM FOR CUSTOMIZED INFORMATION DELIVERY 

CROSS-REFERENCE TO RELATED APPLICATIONS 

This patent application is related to U.S. Patent Application Serial No. 08/??, filed 
November 28, 1994 and titled "SYSTEM AND METHOD FOR SCHEDULING BROADCAST 
OF AND ACCESS TO VIDEO PROGRAMS AND OTHER DATA USING CUSTOMER 
PROFILES", which application is assigned to the same assignee as the present application. 

FIELD OF THE INVENTION 

This invention relates to customized information retrieval and delivery in an electronic 
media environment and, in particular, to a system that automatically constructs both an 
"article profile" for each article in the electronic media based on the frequency with which each 
word appears in the article relative to its overall frequency of use in all articles, as well as a 
"user profile" for each user based on which news articles a user is most likely to wish to read. 
The system then processes the two profiles to generate a user customized rank ordered 
listing of articles most likely of interest to the user so that the user can select from these 
relevant articles automatically selected by the system from the plethora of articles available 
on the electronic media. 

PROBLEM 

It is a problem in the field of electronic media to enable a user to access information 
of relevance and interest to the user without requiring the user to expend an excessive 
amount of time and energy. Electronic media, such as on line information sources, provide 
a vast amount of information to users, typically in the form of "articles", each of which 
comprises a publication item or document that relates to a specific topic. The difficulty with 
electronic media is that the amount of information available is overwhelming and the articles 
that contain this information are not organized on article repository systems that are 
connected to the electronic media in a manner that simplifies access by a user to only the 
articles of interest to the user. Presently, a user either fails to access relevant articles 
because they are not easily identified or expends a significant amount of time and energy to 
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conduct an exhaustive search of all articles to identify those most likely of interest to the user. 
Furthermore, even if the user conducts an exhaustive search, present information searching 
techniques do not necessarily accurately extract only the most relevant articles, but also 
present articles of marginal relevance due to the functional limitations of the information 
searching techniques. 

Researchers in the field of information retrieval have devoted considerable effort to 
finding efficient and accurate methods of allowing users to select articles of interest from a 
large set of articles. The most widely used methods of information retrieval are based on 
keyword matching: the user specifies a set of keywords which the user thinks will exclusively 
be found in the desired articles and the information retrieval computer retrieves all articles 
which contain those keywords. Such methods are fast, but are notoriously unreliable, as 
users may not think of the right keywords or the keywords may be used in articles in an 
irrelevant or unexpected context. As a result, the information retrieval computers retrieve 
many articles which are unwanted by the user. The logical combination of keywords and the 
use of wild card search parameters help improve the accuracy of keyword searching but do 
not completely solve the problem of inaccurate search results. 

Starting in the 1960's, an alternate approach to information retrieval was proposed: 
users were presented with an article and asked if it contained the information they wanted, 
or to quantify how close the information contained in the article was to what they wanted. 
Each article was described by a profile which comprised either a list of the words in the article 
or, in more advanced systems, a list of word frequencies in the article. Since a measure of 
similarity between articles is the distance between their profiles, the measured similarity of 
article profiles can be used in article retrieval. For example, a user searching for information 
on a subject can write a short description of the desired information. The information retrieval 
computer generates an article profile for the request and then retrieves articles with profiles 
similar to the profile generated for the request. These requests can then be refined using 
"relevance feedback", where the user rates the articles retrieved as to how close the 
information contained therein is to what is desired. The information retrieval computer then 
uses this relevance feedback information to refine the request profile and the process is 
repeated until the user either finds enough articles or tires of the search. 



A number of researchers have looked at methods for selecting articles of most interest 
to users. An article titled "Social Information filtering: algorithms for automating 'word of 
mouth"' was published at the CHi-95 Proceedings by Patti Maes et al and describes the Ringo 
information retrieval system which recommends musical selections. The Ringo system 
requires active feedback from the users - users must manually specify how much they like or 
dislike each musical selection. The Ringo system maintains a complete list of users 1 ratings 
of music selections and makes recommendations by finding which selections were liked by 
multiple people. However, the Ringo system does not take advantage of any available 
descriptions of the music, such as structured descriptions in a data base, or free text, such 
as that contained in music reviews. An article titled "Evolving agents for personalized 
information filtering", published at the Proc. 9th IEEE Conf. on Al for Applications by Sheth 
and Maes, described the use of agents for information filtering which use genetic algorithms 
to learn to categorize USENET news articles. In this system, users must define news 
categories and the users actively indicate their opinion of the selected articles. Their system 
uses a list of keywords to represent sets of articles and the user profiles are updated using 
genetic algorithms. 

A number of other research groups have looked at the automatic generation and 
labeling of clusters of articles for the purpose of browsing through the articles. A group at 
Xerox Pare published a paper titled "Scatter/gather: a cluster-based approach to browsing 
large article collections" at the 15 Ann. Int'l SIGIR '92, ACM 318-329 (Cutting et al. 1992). 
This group developed a method they call "scatter/gather" for performing information retrieval 
searches. In this method, a collection of articles is "scattered" into a small number of clusters, 
the user then chooses one or more of these clusters based on short summaries of the cluster. 
The selected clusters are then "gathered" into a subcollection, and then the process is 
repeated. Each iteration of this process is expected to produce a small, more focussed 
collection. The cluster "summaries" are generated by picking those words which appear most 
frequently in the cluster and the titles of those articles closest to the center of the cluster. 
However, no profiles of users are collected, so no performance improvement occurs over 
time. 



Apple's Advanced Technology Group has developed an interface based on the 
concept of a "pile of articles". This interface is described in an article titled "A 'pile 1 metaphor 
for supporting casual organization of information in Human factors in computer systems" 
published in CHI '92 Conf. Proc. 627-634 by Mander, R. G. Salomon and Y Wong. 1992. 
Another article titled "Content awareness in a file system interface: implementing the 'pile' 
metaphor for organizing information" was published in 16 Ann. Int'l SIGIR '93, ACM 260-269 
by Rose E. D. et al. The Apple interface uses word frequencies to automatically file articles 
by picking the pile most similar to the article being filed. This system functions to cluster 
articles into subpiles, determine keywords for indexing by picking the words with the largest 
TF/IDF (where TF is term (word) frequency and IDF is the inverse article frequency) and label 
piles by using the determined key words. 

Numerous patents address information retrieval methods, but none develop user 
profiles based on passive monitoring of which articles the user accesses. None of the 
systems described in these patents present computer architectures to allow fast retrieval of 
articles distributed across many computers. None of the systems described in these patents 
address issues of using such article retrieval and matching methods for purposes of 
commerce or of matching users with common interests. U.S. Patent No. 5,321 ,833 issued 
to Chang et al. teaches a method in which users choose terms to use in an information 
retrieval query, and specify the relative weightings of the different terms. The Chang system 
then calculates multiple levels of weighting criteria. U.S. Patent No. 5,301,109 issued to 
Landauer et al. teaches a method for retrieving articles in a multiplicity of languages by 
constructing "latent vectors" (SVD or PCA vectors) which represent correlations between the 
different words. U.S. Patent No. 5,331 ,554 issued to Graham et al. discloses a method for 
retrieving segments of a manual by comparing a query with nodes in a decision tree. U.S. 
Patent No. 5,331,556 addresses techniques for deriving morphological part-of-speech 
information and thus to make use of the similarities of different forms of the same word (e.g. 
"article" and "articles"). 

Therefore, there presently is no information retrieval and delivery system operable in 
an electronic media environment that enables a user to access information of relevance and 




interest to the user without requiring the user to expend an excessive amount of time and 
energy. 
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SOLUTION 

The above-described problems are solved and a technical advance achieved in the 
field by the system for customized information retrieval and delivery in an electronic media 
environment, which system automatically constructs both an "article profile" for each article 
5 in the electronic media based on the frequency with which each word appears in the article 

relative to its overall frequency of use in all articles, as well as a "user profile" for each user 
based on which news articles a user is most likely to wish to read. The system then 
processes the two profiles to generate a user customized rank ordered listing of articles most 
likely of interest to the user so that the user can select from these relevant articles 

10 automatically selected from the plethora of articles available on the electronic media. 

Because people have multiple interests, multiple profiles are maintained for each user, 
corresponding to multiple topics of interest. Each user is presented with those articles whose 
profiles most closely match the user's profiles. User profiles are automatically updated on a 
continuing basis to reflect each user's changing interests. In addition, articles can be grouped 

15 into clusters and menus automatically generated for each cluster of articles to allow users to 

navigate throughout the clusters and manually locate articles of interest. 

In the preferred embodiment of the invention, the system for customized delivery of 
information uses a fundamental methodology for accurately and efficiently matching users 
and objects by calculating, using and automatically updating profiles that describe both the 

20 users' interests and the objects' characteristics. The objects may be published articles, 

purchasable items, or even other people. Examples include a person looking for a newspaper 
story of potential interest, a movie to watch, an item to buy, or another person to correspond 
with. In all cases, the method is based on determining the similarity between a profile for the 
object and a profile for the user. Object profiles may include some or all of the following: (1 ) 

25 structured text (as in a newspaper story, a movie review, a product description or an 

advertisement), (2) structured data (the author of the newspaper article, the actor and 
directors and rating given to a movie, the price and manufacturer or a product, or the name 
and phone number of the person placing an advertisement) and (3) a list of the persons who 
have accessed (read or purchased) the object. The structured data may, in particular, contain 



information about the quality of the object, such as measures of its popularity (how often it is 
accessed) or of user satisfaction (number of complaints received). 

The ability to measure the similarity of profiles describing objects and a user's interests 
can be applied in two basic ways: filtering and browsing. Filtering is useful when large 
numbers of objects exist in the electronic media space. These objects can be articles which 
are being received by a user, who only has time to read a small fraction of them. For 
example, one might potentially receive all items on the AP news wire service, all items posted 
to a number of news groups, or all advertisements in a set of newspapers, but few people 
have the time or inclination to read so many articles. A filtering system in the customized 
information delivery system selects a subset of the articles which the user is more likely to 
wish to read. The accuracy of this filtering system improves overtime by noting which articles 
the user reads and by generating a measurement of the depth to which the user reads the 
article. This information is then used to update the generated user's profile. 

Browsing provides an alternate method of selecting a small subset of a large number 
of objects, such as articles. Articles are organized so that users can navigate among the 
articles by moving from broad groups of articles to smaller, more specific, groups, to individual 
articles, or from articles or article groups to closely related articles or groups. The methods 
used by the customized information delivery system allow articles to be grouped into clusters 
and the clusters to be grouped into larger and larger clusters. These hierarchies of clusters 
then form the basis for menuing and navigational systems to allow the rapid of search large 
numbers articles. 

There are a number variations on the theme of developing and using profiles for article 
retrieval, with the basic implementation of an on-line news clipping service representing the 
preferred embodiment of the invention. Variation of this basic system are disclosed and 
comprise a system to filter electronic mail, an extension for retrieval of objects such as 
purchasable items which may have more complex descriptions, a system to automatically 
build and alter menuing systems for browsing, and a system to construct virtual communities 
of people with common interests. 



BRIEF DESCRIPTION OF THE DRAWING 

Figure 1 illustrates in block diagram form a typical architecture of an electronic media 
system in which the customized information delivery system of the present invention can be 
implemented as part of a user server system; 

Figure 2 illustrates in flow diagram form the operational steps taken by the customized 
information delivery system to screen articles for a user; 

Figure 3 illustrates in block diagram form additional details of the information server 
module of the customized information delivery system; 

Figure 4 illustrates in flow diagram form a method for automatically generating article 
profiles and an associated hierarchical menu system; 

Figures 5-9 illustrate examples of menu generating process; 

Figure 10 illustrates 

DETAILED DESCRIPTION 

In the preferred embodiment of the invention, the system for customized delivery of 
information uses a fundamental methodology for accurately and efficiently matching users 
and objects by calculating, using and automatically updating profiles that describe both the 
users' interests and the objects' characteristics. The objects may be published articles, 
purchasable items, or even other people. Examples include a person looking for a 
newspaper story of potential interest, a movie to watch, an item to buy, or another person to 
correspond with. In all cases, the method is based on determining the similarity between a 
profile for the object and a profile for the user. Object profiles may include some or all of the 
following: (1 ) structured text (as in a newspaper story, a movie review, a product description 
or an advertisement), (2) structured data (the author of the newspaper article, the actor and 
directors and rating given to a movie, the price and manufacturer or a product, or the name 
and phone number of the person placing an advertisement) and (3) a list of the persons who 
have accessed (read or purchased) the object. The structured data may, in particular, contain 
information about the quality of the object, such as measures of its popularity (how often it is 
accessed) or of user satisfaction (number of complaints received). 



The preferred embodiment of the system for customized information retrieval and 
delivery operates in an electronic media environment for accessing articles, which may be 
news, electronic mail, other published documents, or product descriptions. The system in its 
broadest construction comprises three modules, which may be separate entities, or combined 
into a lesser subset of physical entities. The specific embodiment of this system illustrates 
the use of a first module which automatically constructs an "article profile" for each article in 
the electronic media based on the frequency with which each word appears in the article 
relative to its overall frequency of use in all articles. A second module constructs a "user 
profile" for each user, which user profile is based on the news articles a user is most likely to 
wish to read. The system further includes a profile processing module which processes the 
two profiles to generate a user customized rank ordered listing of articles most likely of 
interest to the user so that the user can select from these relevant articles automatically 
selected from the plethora of articles available on the electronic media. Because people have 
multiple interests, multiple profiles can be maintained for each user, corresponding to multiple 
topics of interest. Each user is presented with those articles whose profiles most closely 
match the user's profiles. User profiles are automatically updated on a continuing basis to 
reflect each user's changing interests. In addition, articles can be grouped into clusters and 
menus automatically generated for each cluster of articles to allow users to navigate 
throughout the clusters and manually locate articles of interest. 
Electronic Media System Architecture 

Figure 1 illustrates in block diagram form the overall architecture of an electronic media 
system in which the system of the present invention can be used to provide user customized 
access to objects that are available via the electronic media system. In particular, the 
electronic media system comprises a data communication facility that interconnects a plurality 
of users with a number of information servers. The users are typically individuals, whose 
personal computers are connected via modem to a telecommunication network. User 
information access software is resident on the user's personal computer and serves to 
communicate via the modem and a telephone connection established in well known fashion 
over the telecommunication network with a network vendor who provides data interconnection 
service with selected ones of the information servers. The user can, by use of the user 
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information access software, interact with the information servers to request and obtain 
access to data that resides on mass storage systems that are part of the information server 
apparatus. New data is input to this system by users via their personal computers and by 
commercial information services by populating their mass storage systems with commercial 
data. Each user and the information servers have electronic mail addresses which enable 
data communication connections to be established between a particular user and the selected 
information server. A user's e-mail address uniquely identifies the user, the user's information 
access software in an industry standard format such as: username@aol.com or 
username@netcom.com. The network vendors provide telephone access numbers for their 
subscribers, through which the users can access the information servers. The subscribers 
pay the network vendors for the access services on a fee schedule that typically includes a 
monthly subscription fee and usage based charges. A difficulty with this system is that there 
are a large number of information servers located around the world, each of whom provide 
access to a set of information of differing format, content and topics and via a cataloging 
system that is typically unique to the particular information server. The information is 
comprised of individual "files" (herein termed "articles"), which can be audio data, video data, 
graphics data, text data and combinations thereof. The text data can be of the class: 
commercially provided news articles, published documents, letters, user generated 
documents, database data or combinations of these classes of data. The organization of the 
articles containing the information and the native format of the data contained in articles of 
identical class may vary by information server. Thus, a user can have a difficult task to locate 
articles that contain the desired information, because the information may be contained in 
articles, whose information server cataloging may not enable the user to locate the article. 
Furthermore, there is no standard catalog that defines the presence and services provided 
by all information servers. A user therefore does not have simple access to information but 
must expend a significant amount of time and energy to excerpt a segment of the information 
that may be relevant to the user from the plethora of information that is generated and 
populated on this system. Even if the user commits the necessary resources to this task, 
existing information retrieval processes lack the accuracy and efficiency to ensure that the 
user obtains the desired information. 



It is obvious that within the constructs of this electronic media system, the three 
modules of the customized information delivery system can be implemented in a distributed 
manner, even with various modules being implemented on and/or by different vendors within 
the electronic media system. For example, the information servers can include the article 
profile generation module while the network vendor may implement the user profile generation 
modules and/or the profile processing module. Various other partitions of the modules and 
their functions are possible and the example provided represents an illustrative example and 
is not intended to limit the scope of the claimed invention. 

News Clip pin g Service 

The customized information delivery system of the present invention can be used in 
the electronic media system of Figure 1 to implement an automatic news clipping service 
which learns to select (filter) news articles to match a user's interests, based solely on which 
articles the user chooses to read. The customized information delivery system generates a 
profile for each article that enters the electronic media system based on the relative frequency 
of occurrence of the words contained in the article. The customized information delivery 
system also generates a profile for each user, as a function of a number of user specific 
characteristics, which makes each user's profile unique to that user. As new articles are 
received for storage on the mass storage systems of the information servers, the customized 
information delivery system generates their profiles, which are compared to the user's profiles, 
and articles which are closest (most similar) to the user's profile are presented to the user for 
possible reading. The computer program providing the articles to the user monitors how 
much the user reads (the number of screens of data and the number of minutes spent 
reading), and adjusts the user's profiles to more closely match what was read. The details 
of the method used by this system are disclosed in flow diagram form in Figure 2. This 
method require selecting a specific method of calculating profiles, of measuring similarity of 
articles and profiles, and of updating profiles based on what the user read, and the examples 
disclosed herein are examples of the many possible implementations that can be used and 
should not be construed to limit the scope of the system. 
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Retrieve New Articles From Article Source 

Articles are available on-line from a wide variety of sources. In the preferred 
embodiment, one would use the current days news as supplied by a news source, such as 
the AP or Reuters news wire. At step 201 on Figure 2, these news articles are input to the 
electronic media system by being loaded into the mass storage system of an information 
server. The article profile module of the customized information delivery system can reside 
on the information server and, as each article is received by the information server, the article 
profile module at step 202 generates a profile for the article and stores the profile in an article 
indexing memory for later use in selectively delivering articles to users. This method is 
equally useful for selecting which articles to read from electronic news groups and electronic 
bulletin boards, and can be used as part of a system for screening and organizing electronic 
mail ('email'). 

Calculate Article Profiles 

An article profile is computed for each article to indicate the relative frequency of word 
occurrences in this article relative to other comparable articles. Many measures can be used, 
but the preferred profile is calculated using the TF/IDF measure: the term (word) frequency 
(TF) times the inverse article frequency (IDF), where the word frequency is the number of 
occurrences of a word in a article divided by the total number of words in the article and the 
inverse article frequency is taken to be one over the logarithm of the fraction of articles in 
which the word occurs. Words which occur frequently ("a", "and", "the" ...) can influence 
calculated article similarities without indicating article topics, and so are typically removed 
from feature vectors. Other methods of calculating relative word frequencies can also be 
used, such as latent semantic indexing or probabilistic models. For many applications, it is 
useful to augment the set of words actually found in the article with a set of synonyms or other 
words which tend to co-occur with the each word in the article. This provides profiles which 
match a broader class of articles and reduces the chance of missing desired articles. 

Synonym dictionaries which group together similar words can be used to improve the 
performance of the profiling and menuing systems. Words which are similar can be treated 
as if they were the same word, or articles containing a word can be augmented with 
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synonyms of the word or with other words that tend to occur with the word. Thus, an article 
talking about "staples" could also be matched to articles about "staplers". The stapler 
example also illustrates the potential utility of morphological analysis where "staple", "stapled" 
and "staples" would be treated as the same word. 

Compare Article Profiles To User Profiles 

A set of profiles is stored for each user. These can be initially selected by any of a 
number of procedures. Among the preferred methods are: using the profiles of articles that 
the user indicates are representative of his interest, asking the user for key words, or using 
a standard set of profiles for the people in a given demographic mix. The user profile 
generation module can reside in any of a number of locations in the electronic media system. 
For example, the user profile can reside on the user's own personal computer and be 
periodically accessed, such as off hours, by the network vendor as part of a news delivery 
service. The user profile can reside on the network vendor's apparatus and again be used 
as part of a news delivery service. Another possibility is that the user's profile is migrated out 
to a number of selected information servers, which perform all the processing of the 
customized information delivery service and simply forward desired articles to the user. For 
the purpose of example, presume that the user profile generation module resides on the 
network vendor system and is user therein to identify articles, newly stored on various 
information server systems, that are of interest to the user, whose user profile is presently 
being processed. 

At step 203 of the process, all article profiles of articles newly entered into information 
server systems, broadcast on the network, or otherwise made available to the network vendor 
are compared against the selected user's profile. Those articles whose profile most closely 
correlates to the selected user's profile are selected. The cosign measure of similarity 
between article profiles and user profile queries can be used for this correlation processing 
step, and follows the following algorithm: 

£(term ik term jk ) 
cosine(Profilej, Profile^ = . 
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sqrt(£ term ik 2 £ term Jk 2) 

where all sums are over k - sums over all the terms (TF/IDFs for each word), term ik is the 
TF/IDF for the kth word in profile i, and term jk is the TF/IDF for the kth word in profile j. Since 
5 most words do not appear in most articles, most terms in most article profiles are zero, the 

articles are substantially orthogonal to each other, and obvious measures of the similarity of 
between articles such as Euclidean distance do not work well. Other similarity metrics such 
as the dice measure can also be used. 

10 Present List Of Articles To User 

Once the user profile/article profile correlation step is completed for a selected user, 
at step 204 the profile processing module presents a list of titles of the selected articles to the 
user, who can then select any article for viewing. (If no titles are available, then the first 
sentences of each article can be used.) The list of article titles is ordered according to the 

is degree of similarity of the article profile to the user's profile. This list is either transmitted in 

real time to the user as the user is present at their personal computer, or can be transmitted 
to a user's mailbox, resident on the user's personal computer or within the network server. 
The user can then elect which if any, of the identified articles the user wishes to review. The 
user can still access all articles in any information sever to which the user has authorized 

2 o access, however, those lower on the list are simply further from the users interests, as defined 

by the generated user's profile. 

Monitor Which Articles Are Read 

The customized information retrieval system at step 205 monitors which articles the 
2 5 user reads, keeping track of how many pages of text are transmitted to the user (read), how 

much time is spent viewing the article, and whether all pages of the article were viewed. This 
information can be combined to measure the depth of the user's interest in the article. 
Although the exact details depend on the length and nature of the articles being searched, 
a typical formula might be: 



30 
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measure of article attractiveness = 

0.2 if the second page is accessed + 
0.2 if all pages are accessed + 
0.2 if more than 30 seconds was spend on the article + 
0.2 if more than one minute was spend on the article + 
0.2 if the minutes spent in the article are greater than half 
the number of pages 

The computed measure of article attractiveness can then be used as a weighting function to 
adjust the user's profile to thereby more accurately reflect the user's dynamically changing 
interests. 

Update User Profiles 

Updating of user profiles can be done at step 206 using the method described in 
copending U.S. Patent Application Serial No. 08/ ???. The user profile closest the article 
profile of the article read is shifted slightly towards the article profile. Given a set of J articles 
available with characteristics d jk (assumed correct for now), and a set of user preferences, u ik 
, viewer i would be predicted to pick a set of P articles to minimize: 

L d(d jk ,u ik ) 

j in the 

best P of J 

The article characteristics d jk would be some form of word frequencies such as TF/IDF , the 
user preferences, u ik , are user profiles, and d(d jk u ik ) is the distance between them using the 
cosine measure. If the user picks a different set of P articles than was predicted, the user 
profile generation module should try to adjust d and u to more accurately predict the articles 
the user selected. In particular, d and u should be shifted to reduce the match on articles that 
were predicted to be selected but were not selected and also to increase the match on articles 
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that were predicted not to be selected but were selected. A preferred method is to shift u for 
each wrong prediction for user i and article j using the formula: 

u ik = u ik - e(u ik - d jk ) 

5 

This adjustment increases the match between a generated user's profile and the article profile 
of desired articles by making u closer to d if e is positive - for the case where the algorithm 
failed to predict aa article that the viewer read. The size of e determines how many example 
articles one must see to replace what was originally believed. If e is too large, the algorithm 

10 becomes unstable, but for sufficiently small e, it drives u to its correct value. One could in 

theory also make use of the fact that the above algorithm decreases the match if e is negative 
- for the case where the algorithm predicted an article that the user did not read. However, 
there is no guarantee that u will move in the correct direction in that case. If, as is typically 
the case, there are multiple user profiles which may be applicable, then only the most 

15 applicable profile - that with the highest agreement with the article selected - is updated. 

One can also shift the term weights w using a similar algorithm: 
w ik = (w ik - e|u ik - d jk |) / S k (w jk - e|u ik - d jk |) 

20 

This is particularly important if one is combining word frequencies with other, structured 
data. As before, this increases the match if e is positive - for the case where the algorithm 
failed to predict an article that the user read, this time by decreasing the weights on those 
characteristics for which the user profile differs from the profile of the article. Again, the size 

2 5 of e determines how many example articles one must see to replace what was originally 

believed. Unlike the case for u, one also make use of the fact that the above algorithm 
decreases the match if e is negative - for the case where the algorithm predicted an article 
that the user did not read. The denominator of the expression assures that the modified 
weights w still sum to one. Both u and w can be adjusted for each article accessed. When 

3 0 e is small, as it should be, there is no conflict between the two parts of the algorithm. 



-17- 



This description speaks entirely of counting words in an article, but the methods 
proposed can be trivially extended to counting n-grams of letters. (N-grams are just 
sequences of N contiguous letters in a article. For example, this sentence contains a 
sequence of 5-grams which starts "For e", "or ex", "r exa", "exam", "examp", etc. Articles can 
be clustered based on the similarity of the occurrence of sequences of five letters (or other 
letter n-grams) in the articles. Clusters are displayed on a two dimensional plot with distances 
between the articles indicating how dissimilar they are. 

In any case, the selected user's profile is updated at step 206 and the process returns 
to step 203 for the next user in the catalog of users served by the network vendor, in this 
specific example. Alternatively, when a new article enters the electronic media system, its 
profile is generated in a timely manner and this profile is inserted into the pool of profiles to 
identify users who have expressed an interest in this article. There is obviously the need for 
multiprocessing capability in this system and the exact temporal ordering of the various 
operations can be determined based on the computing and communications facilities 
available. Furthermore, the users can pay higher service rates to receive "instant news" so 
that articles that enter the system are delivered on a rush basis to these users, while other 
users have their information retrieval requests processed in due course. 

Matching Users For Purchasing And Virtual Communities 

The user profiles described above can also form the basis for matching buyers and 
sellers, people bartering goods, and people with common interests. These matches may be 
either pairs of people or corporate entities, as in a buyer and a seller, or they may be larger 
groups, as in virtual communities of people wishing to discuss subjects of common interest. 

It is common for merchandisers to provide catalogues with descriptions of the products 
which they sell. These are now often available on-line on CD-ROMS or the internet. The 
profile-based news clipping and information retrieval techniques described above can also 
be used to help users locate items which they may wish to purchase. Profiles of the product 
descriptions and of product reviews can be made, treating them as standard articles 
augmented with structured database information. Users repeatedly shopping forgiven items 
can have profiles created and modified to reflect the item descriptions which they find most 
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interesting. For single purchases, menus generated by clustering items can be used. This 
is particularly important when a user may be shopping from items sold by a large number of 
vendors, where vendors' descriptions and product categories are variable. 

Such profiles can also be used to match any two people based either on profiles 
5 generated from descriptions they provide ("want ads" to buy or to sell) or based on profiles 

of what they have purchased or sold in the past. Such matching is particularly critical in "thin" 
markets such as most barters, and in certain contract employment situations, where those 
involved in the barter or contracting are looking for specific items or abilities. In a market 
where many different people are looking for unusual products or services, the ability to 

10 automatically search through millions of product and service descriptions to find matches is 

particularly valuable. As described above, the filtering (profile updating) is useful when one 
is repeatedly seeking similar people (an ongoing need for database consultants with specific 
experience), while the browsing (cluster-based menu) system is more useful for unusual 
searches (one is looking for a consultant for a specific data base job, based on a description 

15 of that job). 

If the descriptions of the purchasable items are free text, then the methods described 
for the news clipping service can be used directly. However, purchasables often have 
structured descriptions, including attributes such as price, availability of colors or sizes, 
delivery times and costs, popularity (volume sold), ratings by users or independent 

20 organizations, etc. Such items can be included along with relative word frequencies (TF/IDF 

measures) in the article profiles, as long as care is taken to accurately compute the relative 
weightings to be given to the different attributes. 

Virtual Community 

2 5 Computer users frequently join other users for discussions on computer bulletin boards 

or news groups. In current practice, each bulletin board has a specified topic and users then 
look through a long list of topics (typically hundreds) to find topics of interest. The users then 
must select for themselves which of thousands of messages they find interesting from among 
those posted to the selected bulletin boards and, if desired, post additional discussion on the 

3 0 topics. The customized information delivery system described above can also be used to 
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allow users to find other people with similar interests and to generate virtual communities of 
people with similar interests. Profiles of each person can be formed by the news-clipping 
method described above: a person's set of profiles is basically the centroids of the clusters 
of the profiles of the bulletin board articles read by the user. Users can then be grouped 
together based on the similarity of one or more of their profiles. Such groups of people have 
been reading and writing about similar topics and in similar styles and so presumably share 
interests. New bulletin boards can be formed as sufficient numbers of users with different 
interests accumulate. 

The existence of thousands of Internet bulletin boards (also termed news groups) and 
countless more private bulletin board services (BBS's) demonstrates the very strong interest 
among members of the electronic community in forums for the discussion of ideas about 
almost any subject imaginable. Presently, bulletin board creation proceeds in a haphazard 
form, usually instigated by a single individual who decides that a topic is worthy of discussion. 
There are protocols on the Internet for voting to determine whether a news group should be 
created, but there is a large hierarchy of news groups (all beginning with the prefix "alt.") that 
do not follow this protocol. 

The customized information delivery system can function as a browser for bulletin 
boards where each bulletin board is characterized by one or more profiles, so that potential 
new bulletin board users can locate bulletin boards by presenting the customized information 
delivery system with one or more messages typical of what they are looking for. User profiles 
are generated from the messages using the methods described above, and then the user 
identifies those bulletin boards most closely corresponding to the messages. Note that 
bulletin boards, like all objects with profiles, could also be clustered and put into an 
automatically generated menu. 

The above described method is useful if people with the right set of interests have 
already formed a bulletin board or chat group. However, because people have varied and 
varying complex interests, it is desirable to automatically create groups of people ( virtual 
communities with common interests. The Virtual Community System (VCS) described below 
is a network-based agent that seeks out users of a network with common interests, 
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dynamically creates bulletin boards or electronic mailing lists for those users, and introduces 
them to each other electronically via e-mail. 

The functions of Virtual Community Service are general functions that could be 
implemented on any network ranging from an office network in a small company to the World 
Wide Web or the Internet. The four main steps in the procedure are: 

1 . Scanning postings to bulletin boards and clustering people who post similar 
messages. 

2. Create new bulletin boards. 

3. Announce new bulletin boards. 

4. Enroll users in the bulletin boards. 

Each of these steps can be carried out as described below. 

Scanning 

Virtual Community Service locates users with interests in common to form a virtual 
community. Using the text-searching and clustering technology described above, the Virtual 
Community Service constantly scans all the news groups and electronic mailing lists on a 
given network. The network can be the Internet, or a set of bulletin boards maintained by 
America On-Line, Prodigy, or Compuserve, or a smaller set of bulletin boards that might be 
local to a single organization, for example a large company, a law firm, or a university. 

Clustering the messages sent to bulletin boards based on the similarity of their profiles 
automatically finds threads of discussion that show common interests among the users. 
Naturally, discussions on a single bulletin board tend to show common interests; however, 
this method uses all the texts from every available bulletin board and electronic mailing list. 
Whenever it finds a cluster of sufficient size (for example, 10-20 different messages) in 
different bulletin boards or mailing lists, it forms a new virtual community. 

Bulletin Board Creation 

Virtual Community Service creates a bulletin board or an electronic mailing list, 
whichever is appropriate, representing the newly-formed community. If the newly-found 
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cluster only contains a small number of members, for example 2-10 people, it is probably 
most appropriate to create a small mailing list and subscribe these people to it. Virtual 
Community Service initiates the mailing list by sending an e-mail message containing the text 
it found and a suggested name (see below for techniques for generating cluster labels) for the 
5 new mailing list. After this initial message, users may choose to respond and continue the 

discussion, or they may let it expire if they prefer. All such mailing lists will be time-stamped, 
and if they are not used for a user-determined length of time, Virtual Community Service 
deletes them. 

If Virtual Community Service finds a larger number of people engaged in a discussion 
io in different forums, it creates a new bulletin board instead. Virtual Community Service creates 

a name for this bulletin board and post on it the messages it has collected: all the messages 
in a cluster. For a short period of time, Virtual Community Service continues collecting 
messages automatically using its clustering routines, and these messages are posted to the 
bulletin board. After the time limit has expired, new readers of the bulletin board have to 
15 maintain it by posting items voluntarily. Alternatively, the system manager can allow Virtual 

Community Service to continue posting items automatically for an indefinite period of time, 
if this is deemed useful. 

Announcing 

2 o Virtual Community Service informs all the members of the new virtual community of 

the new bulletin board service (or mailing list), and gives each of them a brief summary of the 
basis of the community. Because bulletin boards are read voluntarily, Virtual Community 
Service sends an e-mail message directly to everyone who posted any text that served as the 
basis for a new bulletin board it has created. As stated above, it will do the same for new 

25 mailing lists. However, with bulletin boards, it simply informs users of the existence and name 

of the bulletin board, and leave it to them to decide whether or not they wish to read it. 
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Enrolling 

Even after creation of a new Virtual Community, Virtual Community Service continues 
to scan existing mailing lists and bulletin boards for messages that belong to that community. 
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Any new messages that appear are cross-posted to the new community, and the person 
posting the message is informed by Virtual Community Service that they belong to a new 
Virtual Community. The user can then decide whether or not to read the new Virtual 
Community Service bulletin board or, if the community is using a mailing list, to subscribe to 
the mailing list. 

With these facilities, Virtual Community Service provides automatic creation of virtual 
communities in any local or wide-area network. The core technology underlying Virtual 
Community Service is creating a search and clustering mechanism that can find articles that 
are "similar" in that the users share interests. This is precisely what was described above. 
One must be sure that Virtual Community Service does not bombard users with notices about 
communities in which they have no real interest. On a very small network a human could be 
"in the loop", scanning proposed virtual communities and perhaps even giving them names. 
But on larger networks the Virtual Community Service has to run in fully automatic mode, 
since it is likely to find a large number of virtual communities. 

Matching of one person against a second person presents additional technical 
difficulties, since each person may have multiple profiles. Defining the match between a 
person with multiple profiles and a article with a single profile is easy: the distance between 
the person and the article is taken to be the smallest of the distances between the article and 
each of the person's profiles. For matching pairs or groups of potential virtual community 
members, one can either 1 ) match each profile of person A with each profile of person B and 
take the closest match, in effect determining if A and B have any interests in common or 2) 
calculate the sum of squares of the distances between the n closest profiles. If n=1, this 
reduces to (1 ); otherwise it looks for more areas of overlap. We generally take n to be 2. 

Menu Generation Based On Clustering and Automatic Cluster Labeling 

A browsing system can be constructed for the retrieval of articles, such as reference 
articles in an on-line library using the same methods underlying the news clipping service. 
The main difference is that since all reference articles are potentially of interest and not just 
the most recent ones, as is the case for news, and since users are often looking for specific 
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topics, the preferred interface includes a menu system that the user can navigate to locate 
articles of interest to the user. 

In order to allow users to rapidly locate a article from among a large set of articles, the 
customized information delivery system calculates profiles for each article using the TF/IDF 
word frequency measures described above. Articles are then grouped into clusters using a 
hierarchical clustering algorithm such as a k-means clustering algorithm. Clusters are groups 
of articles which are similar to each other, as determined by some uniform metric, such as the 
cosign measure. Hierarchical clusters produce a tree which divides the articles first into two 
large clusters of roughly similar articles; then each of these clusters is in turn divided into two 
or more smaller clusters, which in turn are each divided into yet smaller clusters until a 
"cluster" is found consisting of a single article. This division of articles provides an efficient 
method to retrieve a article of interest: the user first chooses between the highest level 
(largest) clusters and then selects among the smaller clusters. This process is repeated until 
the user comes to the lowest level in the tree, which are the articles themselves. 

Hierarchical trees allow rapid selection of one article from a large set. In ten menu 
sections from menus often items each, one can reach 10 10 =1 0,000,000,000 (one hundred 
billion) items. 

The menu generation system is described in flow diagram form in Figure 4. At step 
401 , the article profile module calculates article profiles as described above, using a uniform 
system, such as the TF/IDF measure described above. At step 402, the article profile 
generation module orders all the articles stored in the mass storage system into a hierarchical 
cluster. As noted above, articles are clustered into similar groups (with distance defined by 
the cosine measure) using a hierarchical k-means clustering algorithm. At step 403, the 
article profile generation module generates labels for each cluster in the hierarchy of clusters 
that was produced at step 402. Labels are generated for each cluster by finding the center 
(mean) of the TF/IDF's of the articles in the cluster and selecting those words in the mean of 
the cluster which are farthest from the average TF/IDF across all articles. The near duplicate 
words are then removed, with the near duplicate words being words which are morphological 
variants of the same word (e.g. keep only one of "sale" and "sales" or of "sleep" and 
"sleeping"). Finally, at step 404, the article profile generation module generates menus from 
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the hierarchical cluster structure and cluster labels produced at steps 402, 403. The 
hierarchical clustering gives, as d<add>. At step 405, optionally, the system monitors which 
articles are read and adjusts the generated menus. Each of these steps is described in more 
detail below. 
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Label Generation 

A key requirement to successfully use the menu is the ability to automatically label the 
clusters. Labels can be generated by selecting the article or articles closest to the center of 
the cluster and then displaying either the article title, or the set of words in the profile of the 
centroid of the article cluster which have the highest relative frequency (TF/IDF). More 
informative labels can be generated by removing redundant synonyms from the labels either 
using a synonym dictionary or a morphological analyzer (for example: "salt, salted, salty, 
saltier...." should reduce to "salt") and using terms which have the highest discriminatory 
ability as labels, rather than those with the highest frequency. <explain how this is done> It 
is also useful to allow users to view the articles as needed in case the above descriptive 
labels are insufficient. 

Menu Generation 

Although most clustering methods generate binary branching trees, for human users, 
it is preferable to have 4-8 items in a menu. This is easily accomplished by displaying the four 
"grandchildren" or eight "grandchildren" of a node in a cluster, (see Figure x.) 

Menu Updating 

As users access articles using the menuing system, certain articles and certain regions 
of the article profile space are accessed more frequently than others. These regions 
correspond to the users interests. If, for example, a user frequently accessed articles close 
to a, b, and d in Figure 5 then the menu in Figure 8 could be modified to show the structure 
illustrated in Figure 9. A more formal algorithm for this is: associate with each node of the 
tree a probability that the article the user selects is located under that node, is in that cluster. 
These probabilities can be either all set equal to the number of articles in the cluster divided 
by the total number of articles, or, if data are available, can be based on total access to each 
article by all users. The probability of access of a cluster is the total number of accesses of 
the article in the cluster divided by the total number of accesses of all articles. Once 
weightings are known for each cluster at each level of the tree, user menus can be generated 
to maximize the retrieval efficiency. If menus are chosen so that each menu item has 
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approximately equal probability of accessing the articles under it, then the user has to go 
through the minimum number of menus. 

<walk through details> 

Menu Generation From Article Clusters 

Figure 5 illustrates article profiles. Each letter represents a article. Axes and x 2 are 
two of the hundreds of relative word frequencies. Circles represent clusters. Figure 6 
illustrates a hierarchical cluster tree for the articles displayed in Figure 5. Figure 7 illustrates 
cluster labels for the hierarchical cluster tree shown in Figure 7. Article titles, most 
distinguishing or most frequent words could also be used for cluster labels. Figure 8 
illustrates a collapsed menu tree from the hierarchical cluster tree shown in Figure 7. 

Alternate Embodiments of the Menuinq System 

Many variations and improvement on the above menuing system are possible. 
Different clustering methods can be used, such as fuzzy clustering systems which allow a 
given article to appear in more than one cluster. When retrieving articles which contain 
structured information as well as free text, the word frequencies can be supplemented with 
other terms such as the manually assigned category of the article, the price of the object 
being described in the article, or expert ratings of the object being described in the article, 
such as the number of stars given to a movie in a movie review. Similarity measures 
(distance metrics) can be adjusted to reflect different stress laid by different users on different 
terms or items. 

The automatic menuing system described above can be used in parallel with manually 
generated classifications of articles when such are available. Users can index into clusters 
either starting at the top of the tree and moving to more specific subclusters or by starting by 
giving an English language query which is matched against one of the clusters. After an initial 
cluster is located by finding the cluster center most similar to the query, the user can move 
to adjacent clusters. It is generally less efficient to start at the largest cluster and repeatedly 
select smaller subclusters than it is to write a brief description of what one is looking for and 
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then to move to nearby clusters if the objects initially recommended are not those desired. 
<explain further?> 

Clustering 

Consider the following hypothesis: each user reads articles from two different clusters. 
The goal is to reconstitute the "original clusters" from the observed list of articles read. This 
can be cast as a clustering problem similar to the k-means, but now the criterion being 
optimized is a little different: 

Sumj Sum_c (MC xj x_C A bar) A 2 

where C is the cluster, i is the article, x_C A bar is the mean of cluster C, and MC is an 
indicator matrix which is zero off the diagonal and is between zero and one on the diagonal, 
indicating how much the article is in which cluster. Note: one could also use a cosine 
measure. 

Sum_C IjC = I (the identity) 

For k-means, MC is either I or 0. For the scenario at the top of the message, MC is 
0 for all by two clusters, and has a mix of 1's and O's on the diagonal for the two clusters that 
the movies do fall in. We have discussed two types of clustering: 

1) Object-based clustering, where (a) cluster users based on the similarity of the 
articles they read or (b) cluster articles based on being read by the same users. 

2) Attribute-based clustering, where (a) cluster articles based on the similarity of their 
attributes (word frequencies) or (b) cluster users based on similar attributes (demographics 
and psychographics). 

One could imagine several different combination methods: 
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3) Hybrid method 1 : create a combined function to be minimized which includes the 
standard costs associated with some combination of 1a, 1b, 2a and 2b. Simultaneously 
minimize the distance of articles and users from their cluster centers as found both by both 
object and attribute clustering, by using standard k-means clustering. 

4) Hybrid method 2: do 1 b, so that articles are labeled by cluster based on which user 
read them, then use supervised clustering (maximum likelihood discriminant methods) using 
the word frequencies to do 2a. This tries to use our knowledge of who read what to do a 
better job of clustering based on word frequencies. One could similarly combine a1 and b2. 

Summary 

A method has been presented for automatically selecting articles of interest to a user. 
The method generates profiles of the users based on the relative frequency of occurrence of 
words in the articles which they read. It is characterized by passive monitoring (users do not 
need to explicitly rate the articles), multiple profiles per user (reflecting interest in multiple 
topics) and use of elements of the profiles which are automatically determined from the data 
(the TF/IDF measure based on word frequencies and descriptions of purchasable items). A 
method has also been presented for automatically generating menus to allow users to locate 
and retrieve information on topics on interest. This method clusters articles based on their 
similarity, as measured by the relative frequency of word occurrences. Clusters are labeled 
either with article titles or with key words extracted from the article. The method can be 
applied to large sets of articles distributed over many machines. 

The above methods can be applied to anything for which profiles can be generated, 
these include news articles, reference or work articles, electronic mail, product or service 
descriptions, people (based on the articles they read or the descriptions of the products they 
buy), and electronic bulletin boards (based on the articles posted to them). A particular 
consequence of being able to group people by their interests is that one can form virtual 
communities of people of common interest, who can then correspond with one another via 
electronic mail. 
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WE CLAIM: 

1 . A method for automatically providing a user with access to selected ones of a 
plurality of articles that are stored on an electronic storage media, where said users are 
connected by user terminals via data communication connections to an information server 
system which includes said electronic storage media, said method comprising the steps of: 

automatically generating article profiles for articles stored in said electronic 
storage media, each of said article profiles being generated from the contents of an 
associated one of said articles; and 

enabling access to said plurality of articles stored on said electronic storage 
media by users via said article profiles. 

2. The method of claim 1 wherein said step of enabling access comprises: 
correlating a user profile, generated for an identified user, with said generated 

article profiles to identify ones of said plurality of articles stored on said electronic storage 
media that are likely to be of interest to said identified user. 

3. The method of claim 2 wherein said step of enabling access further comprises: 
transmitting a list, that identifies at least one of said identified ones of said 

plurality of articles, to said identified user; and 

providing access to a selected one of said plurality of articles stored on said 
electronic storage media in response to said identified user selecting an item from said list. 

4. The method of claim 3 wherein said step of providing access comprises: 
transmitting data, in response to said identified user activating a one of said user 

terminals to identify said selected item on said list, indicative of said identified user's selection 
of said selected item from said one user terminal to said information server via a one of said 
data communication connections. 

5. The method of claim 4 wherein said step of providing access further comprises: 
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retrieving, in response to receipt of said data from said one user terminal, an 
article identified by said selected item from said electronic storage media; and 

transmitting said retrieved article to said one user terminal for display thereon 
to said identified user. 

6. The method of claim 1 wherein said step of enabling access comprises: 
automatically generating a user profile for an identified user that is indicative of 

both article profiles of articles retrieved by said identified user as well as the number of pages 
of said retrieved articles access by said identified user. 

7. The method of claim 6 wherein said automatically generated user profile is also 
indicative of a length of time said identified user accessed said retrieved articles. 

8. The method of claim 1 wherein said step of automatically generating article 
profiles comprises: 

automatically generating a hierarchical menu that directs said users to at least 
a subset of said plurality of articles stored on said electronic media, comprising: 

sorting all articles in said subset into a plurality of clusters of articles 
based on an empirical measure of similarity of content of said articles, and 

generating a hierarchical menu that identifies the content in common of 
articles sorted into each of said plurality of clusters, to enable said identified 
user to identify ones of said plurality of articles stored on said electronic storage 
media that are likely to be of interest to said identified user. 

9. The method of claim 8 wherein said step of automatically generating a 
hierarchical menu further comprises: 

ascribing a label to each of said plurality of clusters. 

10. The method of claim 8 wherein said step of sorting comprises: 
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dividing said plurality of articles into at least two clusters based upon said 
empirical measure of similarity of content of said articles, 

subdividing each of said at least two clusters into at least two subclusters based 
upon said empirical measure of similarity of content of said articles, and 

repeating said step of subdividing to produce a multi-level hierarchy of identified 

clusters. 

1 1 . The method of claim 10 wherein said step of generating a hierarchical menu 
comprises: 

ascribing a label to each cluster produced by all steps of dividing and 
subdividing in said step of sorting. 

12. The method of claim 1 1 wherein said step of ascribing comprises: 
identifying at least one term in said generated article profiles produced for ones 

of said plurality of articles sorted into a cluster that is indicative of the information content of 
said ones of said plurality of articles sorted into said cluster. 

13. The method of claim 1 1 wherein said step of ascribing comprises: 
selecting at least one article of said ones of said plurality of articles sorted into 

said cluster that are closest to the center of the cluster, and 

ascribing a label that is indicative of the information content of said ones of said 
plurality of articles sorted into said cluster, said label comprising elements of at least one of: 
a title of said selected at least one article, and a set of words contained in the article profile 
of said selected at least one article cluster which have the highest relative frequency. 

Additional Claims: 

1 ) A method for automatically building profiles of people reading articles in order to retrieve 
articles of greater interest to the readers. (E.g. a custom news clipping service) 
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2a) A method for screening email based on comparison of user profiles with those of the mail 
messages. 

4) A method for automatically customizing menu trees to allow users to more rapidly access 
articles in repeatedly accessed areas of interest. 

5) A method for locating products or services based on profiles. Profiles are generated for 
each product or service based on the word frequencies of words in the product description 
and reviews and on other descriptive data. Then (a) products or services which the user often 
seeks information about can be selected for the user as in claim 1 or (b) products and 
services can be clustered and compiled into a menu as in claims 2-4. 

6) A method for matching up different people with common interests for purposes such as 
buying, selling or bartering goods or services. Profiles are generated for each individual and 
then individuals are matched based on similarity of their profiles. Again, either a) individuals 
similar to those who the user often seeks information from can be "clipped" for the user as in 
claim 1 or (b) individuals can be clustered and compiled into a menu as in claim 2-3. 

7) A method for matching of sets of people with common interests ("virtual communities") 
based on profiles developed from the messages which the people read and send. 

8) A saleable method for retrieving articles distributed over large numbers of computers. <to 
be completed by JAMS> 

9) A method for combining information on what articles each user has retrieved with 
demographic or other descriptions of the users and with attributes of the article such as word 
frequencies in order to more accurately cluster articles into groups. These groups can then 
be used both for article retrieval and for article filtering. 
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SYSTEM FOR CUSTOMIZED INFORMATION DELIVERY 

ABSTRACT 



The system for customized information retrieval and delivery in an electronic media 
environment, which system automatically constructs both an "article profile" for each article 
in the electronic media based on the frequency with which each word appears in the article 
relative to its overall frequency of use in all articles, as well as a "user profile" for each user 
based on which news articles a user is most likely to wish to read. The system then 
processes the two profiles to generate a user customized rank ordered listing of articles most 
likely of interest to the user so that the user can select from these relevant articles 
automatically selected from the plethora of articles available on the electronic media. 
Because people have multiple interests, multiple profiles are maintained for each user, 
corresponding to multiple topics of interest. Each user is presented with those articles whose 
profiles most closely match the user's profiles. User profiles are automatically updated on a 
continuing basis to reflect each user's changing interests. Alternatively, articles are grouped 
into clusters and menus are automatically generated for each cluster of articles to allow users 
to navigate throughout the clusters and manually locate articles of interest. 
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